# AI Style Analysis

Data and code repository for reproducing the experiments from the AI Style project (WashU AI Humanities Lab), specifally those used for our paper in the Harbard Data Science Review: ‘Written in the Style of’: ChatGPT and the Literary Canon (https://doi.org/10.1162/99608f92.6d5fb5ef)

DOI for this code and data repository: https://doi.org/10.7910/DVN/BRK6JL

# Author Information

  Co-Principal Investigator Contact Information
    Name: Gabi Kiriloff
	  Institution: Washington University in St. Louis
	  Institutions ROR: [WashU = https://ror.org/01yc7t268]
	  Email: kirilloff@wustl.edu

  Co-Principal Investigator Contact Information
    Name: Claudia Carroll
	  Institution: Washington University in St. Louis
	  Institutions ROR: [WashU = https://ror.org/01yc7t268]
	  Email: claudiac@wustl.edu
	  OrcID: 0000-0002-1853-9106

 Associate or Co-investigator Contact Information
    Name: Zeina Daboul
	  Institutions ROR: [WashU = https://ror.org/01yc7t268]
	  Email: d.zeina@wustl.edu

 Associate or Co-investigator Contact Information
	  Name: Arianwyn Frank
	  Institutions ROR: [WashU = https://ror.org/01yc7t268]
	  Email: a.g.frank@wustl.edu

Associate or Co-investigator Contact Information
	  Name: Razi Khan
	  Institutions ROR: [WashU = https://ror.org/01yc7t268]
	  Email: k.razi@wustl.edu

Associate or Co-investigator Contact Information
	  Name: Marielle Hinrichs-Morrow
	  Institutions ROR: [WashU = https://ror.org/01yc7t268]
	  Email: h.marielle@wustl.edu

Associate or Co-investigator Contact Information
	  Name: Rebecca Weingart
	  Institutions ROR: [WashU = https://ror.org/01yc7t268]
	  Email: r.l.weingart@wustl.edu


## What this repo contains

- `analysis_code/` — scripts and notebooks for data analysis and classification experiments:
  - `1_summary_stats.ipynb` — notebook for generating summary statistics
  - `2_calculating_feature_matrix.ipynb` — notebook for building feature matrices from text data
  - `3_random_forest_gpt4.py` — random forest classification script for distinguishing authentic (human) and synthetic text
  - `4_author_holdout_validation.py` — author holdout validation experiments
- `api_code/` — scripts and code for interacting with the OpenAI GPT API:
  - `prompt_runner.py` — demonstrates how to send prompts to the GPT API, handle batching, and save outputs
  - `readme.md` — additional documentation for the API code
- `data/` — data files used in the experiments:
  - `whole_corpus.csv` — the main corpus dataset. We used the GPT4 base model (at data of collection) to generate synthetic text. Please see the associated article and api_code folder for further details. 
  - `top_stops_new.txt` — stop words list for text processing
- Root-level files:
  - `requirements.txt` — Python package dependencies
  - `CITATION.cff` — citation information for the repository
  - `LICENSE` — CC0 1.0 license
  - `NOTICE` — additional notices and attributions


Notebook and python files in the analysis_code folder should be run in the order that they are numbered in the filename to replicate the analysis contained in the paper. The CSV and TXT files contain the data necessary to run the code files.


# Date of data collection:

May 2024-March 2025 


## Python version & environment
- Recommended Python: 3.10 or 3.11. (The code uses modern pandas and scikit-learn APIs; Python 3.8 may work but 3.10+ is recommended.)
- Install required Python packages from `requirements.txt` in this repository.

- If you plan to use the `api_code/` folder to call the GPT API, ensure the OpenAI client (or your chosen API client) is installed. 
```

Quick setup (recommended in a virtual environment):

```bash
# create & activate a venv (macOS/Linux)
python3 -m venv .venv
source .venv/bin/activate

# install dependencies
pip install --upgrade pip
pip install -r requirements.txt

# install OpenAI client and optional dotenv support
pip install openai python-dotenv

# Set your OpenAI API key
export OPENAI_API_KEY="sk-<your-key-here>"

```

## System requirements (CPU/GPU)
- GPU is not required. The code uses scikit-learn's RandomForest implementation which runs on CPU. The scripts will run on a regular laptop or server without CUDA/OpenCL.
- For faster runs, prefer a multi-core CPU (4+ cores). Note: earlier edits temporarily set RandomForest training to use all cores (`n_jobs=-1`) but those changes have been reverted per project owner request; the scripts retain their original behavior.

- Network & API access: the `api_code/` examples call the OpenAI GPT API and therefore requires a valid OpenAI account with an API key, and may incur usage charges. 


## Run duration
Estimated run durations depend on dataset size, number of features, `CONFIG['n_runs']`, and `CONFIG['n_estimators']`. A small run (single author or a few hundred rows, `n_runs=1`, `n_estimators=100`): typically completes in seconds to a couple of minutes on a modern laptop (4+ cores). A full default run (repository defaults such as `n_runs=100` across many authors): can take tens of minutes to multiple hours on a single machine depending on dataset size and CPU. Expect runtime to scale roughly linearly with `n_runs`.

## Reproducibility & requirements
- A `requirements.txt` file is included in the repo root listing the Python packages used by the scripts and notebooks.

## How to run the main scripts
- Edit `classifier_data_code/classification_code/random_forest_classifier.py` or `k_fold_author_validation_classifier.py` to adjust `CONFIG` (paths, `n_runs`, `sample_size_per_category`, etc.).
- Ensure the input CSV (`data/master_feature_matrix.csv`) is present and formatted with the expected columns (the scripts expect columns such as `id`, `author`, `model`, `category`, and the features listed in `feature_cols`).

- Run:

```bash
python classifier_data_code/classification_code/random_forest_classifier.py
```

Results are written to the `data/` directory by default (see `CONFIG['output_dir']`).


## Citation
This repository contains code and data that accompany the article:

Kirilloff, G., Carroll, C., Daboul, Z., Frank, A., Khan, R., Hinrichs-Morrow, M., & Weingart, R. (2025). "'Written in the Style of': ChatGPT and the Literary Canon." Harvard Data Science Review (HDSR). Published Aug 05, 2025. Available: https://doi.org/10.1162/99608f92.6d5fb5ef

If you use the repository code or data, cite this repository; if you use the article or its findings, cite the HDSR article. Example citations:

- For the article:
  Kirilloff, G., et al. (2025). 'Written in the Style of': ChatGPT and the Literary Canon. Harvard Data Science Review. https://hdsr.mitpress.mit.edu/pub/pyo0xs3k/release/1
- For the software (this repository):
  Carroll, C., & Kirilloff, G. (2025). AI Style Analysis (Version 1.0.0) [Computer software]. Zenodo. https://doi.org/10.5281/zenodo.15587211

## License
This repository (code and data) is released under the CC0 1.0 Universal (Public Domain Dedication) license. See the `LICENSE` file for details or visit https://creativecommons.org/publicdomain/zero/1.0/

You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission.

Notes on related work:
- The HDSR article is published under a Creative Commons Attribution 4.0 International (CC BY 4.0) license (see the article page). Please respect the article's CC BY 4.0 terms when reusing article text or figures: attribution is required.


